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Abstract 

The coalescent revolutionised theoretical population genetics, simplify- 
ing, or making possible for the first time, many analyses, proofs, and 
derivations, and offering crucial insights about the way in which the 
structure of data in samples from populations depends on the demo- 
graphic history of the population. However statistical inference under 
the coalescent model is extremely challenging, effectively because no ex- 
plicit expressions are available for key sampling probabilities. This led 
initially to approximation of these probabilities by ingenious application 
of modern computationally-intensive statistical methods. A key break- 
through occurred when Li and Stephens introduced a different model, 
similar in spirit to the coalescent, for which efficient calculations are 
feasible. In turn, the Li and Stephens model has changed statistical in- 
ference for the wealth of data now available which documents molecular 
genetic variation within populations. We briefly review the coalescent 
and associated measure- valued diffusions, describe the Li and Stephens 
model, and introduce and apply a generalisation of it for inference of 
population structure in the presence of linkage disequilibrium. 
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1 Introduction 

John Kingman made a number of incisive and elegant contributions to 
modelling in the field of genetics, several of which are described elsewhere 
in this volume. But it is probably the coalescent, or 'Kingman coalescent' 
as it is often known, which has had the greatest impact. Several authors 
independently developed related ideas around the same time [2U], [S3], 
|30| but it was Kingman's description and formulation, together with his 
proofs of the key robustness results [33] , [33J , [32] which had the greatest 
impact in the mathematical genetics community. 

More than 25 years later the coalescent remains central to much of 
population genetics, with book-level treatments of the subject now avail- 
able [54] . |2"5] , 55J. As others have noted, population genetics as a field 
was theory rich and data poor for much of its history. Over the last five 
years this has changed beyond recognition. The development and wide 
application of high-throughput experimental techniques for assaying mo- 
lecular genetic variation means that scientists are now awash with data. 
Enticing as this seems, it turns out that the coalescent model cannot be 
fully utilised for analysis of these data — it is simply not computation- 
ally feasible to do so. Instead, a closely related model, due to Li and 
Stephens, has proved to be computationally tractable and reliable as a 
basis for inference for modern genetics data. Alternative approaches are 
based on approximate inference under the coalescent. 

Our purpose here is to give a sense of these developments, before 
describing and applying an extension of the Li and Stephens model to 
populations with geographical structure. We do not attempt an extensive 
review. 

The initial historical presentation is necessarily somewhat technical in 
nature, and provides some explanation of the theoretical developments 
leading up to the Li and Stephens model. Readers interested in just the 
Li and Stephens model and its application may begin at Section [4] as the 
presentation of this material does not heavily depend on the previous 
sections. 

Before doing so, one of us (PD) will indulge in a brief personal remin- 
iscence. John Kingman was my doctoral supervisor. More accurately, 
he acted as my supervisor for a year, before leaving Oxford to Chair 
the then UK Science and Engineering Research Council (I have always 
hoped the decision to change career direction was unrelated to his su- 
pervisory experiences). During the year in question, Kingman wrote his 
three seminal coalescent papers. A photocopy of one of the manuscripts, 
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in John's immaculate handwriting, unblemished by corrections or second 
thoughts as to wording, still survives in my filing cabinet. 

There is a certain irony to the fact that although the coalescent was the 
unifying theme of much of my academic work for the following 20 years, it 
formed no part of my work under Kingman's supervision. John's strategy 
with new research students, or at least with this one, was to direct 
them to the journals in the library, note that some of the papers therein 
contained unsolved problems, and suggest that he would be happy to 
offer advice or suggestions if one were stuck in solving one of these. This 
was daunting, and the attempts were largely unsuccessful. It was only as 
Kingman was leaving Oxford that he passed on copies of the coalescent 
manuscripts, and, embarrassingly, it was some years before I saw the 
connection between the coalescent and aspects of my doctoral work on 
interacting particle systems, then under Dominic Welsh's supervision. 



2 The coalescent and the Fleming— Viot process 

To set the scene, we briefly outline the context in which the coalescent 
arises, and then describe the coalescent itself along with the correspond- 
ing process forward in time, the so-called Fleming- Viot measure-valued 
diffusion. We aim here only to give a brief flavour of the two processes 
rather than a detailed exposition. 

The most basic, and oldest, models in population genetics are finite 
Markov chains which describe the way in which the genetic composi- 
tion of the population changes over time. In most cases, these models 
are not tractable, and interest moves to their limiting behaviour as the 
population size grows large, under suitable re-scalings of time. When 
examined forward in time, this leads to a family of measure- valued diffu- 
sions, called Fleming- Viot processes. In a complementary, and for many 
purposes more powerful, approach one can instead look backwards in 
time, and focus on the genealogical tree relating sampled chromosomes. 
In the large population limit, these (random) trees converge to a partic- 
ular process called the coalescent. 

We start with the simplest setting in which individuals are haploid; 
that is they carry a single copy of their genetic material which is inher- 
ited from a single parent. Many organisms, including humans, are dip- 
loid, with each cell carrying two copies of the individual's DNA — these 
copies being inherited one from each of the individual's two parents. It 
turns out that the haploid models described below also apply to dip- 
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loid organisms provided one is interested in modelling the evolution of 
small contiguous segments of DNA — for many purposes we can ignore 
the fact that in diploid organisms these small segments of DNA occur 
in pairs in individuals, and instead model the population of individual 
chromosomes, or more precisely of small segments of them taken from 
the same genomic region, in each individual. In what follows, we will 
respect this particular perspective and refer to the haploid 'individuals' 
in the population we are modelling as 'chromosomes'. 

One simple discrete model for population demography is the Wright- 
Fisher model. Consider a population of fixed size N chromosomes which 
evolves in discrete generations. The random mechanism for forming the 
next generation is as follows: each chromosome in the next generation 
chooses a chromosome in the current generation (uniformly at random) 
and copies it, with the choices made by different chromosomes being 
independent. An equivalent description is that each chromosome in the 
current generation gives rise to a random number of copies in the next 
generation, with the joint distribution of these offspring numbers being 
symmetric multinomial. 

In addition to modelling the demography of a population, a population 
genetics model needs to say something about the genetic types carried by 
the chromosomes in the population, and the way in which these change 
(probabilistically) when being copied from parental to offspring chro- 
mosomes. Formally, this involves specifying a set, E, of possible types 
(usually, if unimaginatively, called the type space), and a matrix of trans- 
ition probabilities Y whose i, jth entry, 7^ , specifies for each i, j g E, 
the probability that an offspring chromosome will be of type j when the 
parental chromosome is of type i. The generality has real advantages: 
different choices of type space E can be used for modelling different kinds 
of genetic information. In most genetic contexts, offspring are extremely 
likely to have the same type as their parent, with changes to this type, 
referred to as mutations, being extremely rare. 

Under an assumption of genetic neutrality, all variants in a popula- 
tion are equally fit and are thus equally likely to be transmitted. This 
assumption allows a crucial simplification: the random process describing 
demography is independent of the genetic types carried by the individu- 
als in the population. In this case, one can first generate the demography 
of the population using, say, the Wright -Fisher model, and then inde- 
pendently superimpose the genetic type for each chromosome, and the 
details of the (stochastic) mutation process which may change types. 
The separation of demography from genetic types lies at the heart of 
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the simplification offered by the coalescent: the coalescent models the 
parts of the demography relevant to the population at the current time; 
information about genetic types can be added independently. The ex- 
tent to which the neutrality assumption applies is rather controversial in 
general, and for humans in particular, but it seems likely that it provides 
a reasonable description for many parts of the genome. 

The Wright-Fisher model may also be extended to allow for more 
realistic demographic effects, including variation in population size, and 
geographical spatial structure in the population (so that offspring chro- 
mosomes are more likely to be located near to their parents). We will 
not describe these here. Somewhat surprisingly, it transpires that the 
simple model described above, (constant population size, random mat- 
ing, and neutrality — the so-called 'standard neutral' model), or rather 
its large population limit, captures many of the important features of 
the evolution of human and other populations. There is an aphorism in 
statistics that "all models are false, but some are useful" . The standard 
neutral model has proved to be extremely useful. 

In a Wright-Fisher, or any other, model, we could describe the ge- 
netic composition of the population at any point in time by giving a list 
of the genetic types currently present, and the proportion of the popu- 
lation currently of each type. Such a description corresponds to giving 
a probability measure on the set E of possible types. It is sometimes 
helpful to think of this measure as the distribution of the type of an 
individual picked at random from the population. Note that summar- 
ising the population composition in this way at a particular time point 
involves an assumption of exchangeability across individuals: it is only 
the types present, and the numbers of individuals of each type, in a par- 
ticular population which matter, with information about precisely which 
individuals carry particular types not being relevant. In this framework, 
when we add details of the mutation process to the Wright-Fisher model, 
by specifying E and T, we obtain a discrete time (probability-) measure- 
valued Markov process. As N becomes large a suitable rescaling of the 
process converges to a diffusion limit: time is measured in units of N 
generations, and mutation probabilities, the off-diagonal entries of the 
matrix T above, are scaled as TV" 1 . For general genetic systems, the 
limit is naturally formulated as a measure-valued process, called the 
Fleming- Viot diffusion. The classical so-called Wright-Fisher diffusion 
is a one-dimensional diffusion on [0, 1] which arises when there are only 
two possible genetic types and one tracks the population frequency of 
one of the types. This is a special case of the Fleming- Viot diffusion, 



G 



Peter Donnelly and Stephen Leslie 



in which we can identify the value of the classical diffusion, p £ [0, 1], 
with a probability measure on a set with just two elements. The beauty 
of the more general, measure- valued, formulation is that it allows much 
more complicated genetic types, which could track DNA sequences, or 
more exotically even keep track of the time since particular mutations 
arose in the population. 

The Fleming- Viot process can thus be thought of as an approximation 
to a large population evolving according to the Wright Fisher model. 
As noted, for the Wright-Fisher model, time is measured in units of N 
generations in this approximation (and the approximation applies when 
mutation probabilities are of order iV _1 ). In fact the Fleming- Viot pro- 
cess arises as the limit of a wide range of demographic models (and we 
refer to such models as being within the domain of attraction of the 
Fleming- Viot process), although the appropriate time scaling can dif- 
fer between models. (See, for example, [13].) For background, including 
explicit formulations of the claims made above, see [JO], [JJJ [13]; [13], 

Donnelly and Kurtz [TU], [UJ give a discrete construction of the 
Fleming-Viot process. As a consequence, the process can actually be 
thought of as describing the evolution of a hypothetically infinite pop- 
ulation, and it explicitly includes the demography of that population. 
Exchangeability figured prominently in Kingman's work in genetics. It 
provides a linking thread here: the Donnelly-Kurtz construction em- 
beds population models for each finite population size N in an infinite 
exchangeable sequence. The value taken by the Fleming-Viot diffusion 
at a particular time point is just the de Finetti representing measure for 
the infinite exchangeable sequence. Given the value of the measure, the 
types of individuals in the population are independent and identically 
distributed according to that measure. 

The coalescent arises by looking backwards in time. Consider again the 
discrete Wright-Fisher model. If we consider two different chromosomes 
in the current generation, they will share an ancestor in the previous 
generation with probability 1/N. If not, they retain distinct ancestries, 
and will share an ancestor in the generation before that with probability 
1/N. The number of generations until they share an ancestor is thus 
geometrically distributed with success probability 1/iV and mean N. In 
the limit for large N, with time measured in units of N generations, this 
geometric random variable converges to an exponential random variable 
with mean 1. 

More generally if we consider k chromosomes, then for fixed k and 
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large N, they will descend from k distinct ancestors in the previous 
generation with probability 

Exactly two will share a common ancestor in the previous generation 
with probability (*) jf + 0(N~ 2 ), and more than a single pair will share 
a common ancestor with probability 0(N~ 2 ). In the limit as N — > oo, 
with time measured in units of N generations, the time until any of the 
k share an ancestor will be exponentially distributed with mean 1/ ( 2 ) , 
after which time a randomly chosen pair of chromosomes will share an 
ancestor. 

Thus, in the large population limit, with time measured in units of 
N generations, the genealogical history of a sample of size n may be 
described by a random binary tree. The tree initially has n branches, for 
a period of time T n , after which a pair of branches (chosen uniformly at 
random, independently of all other events) will join, or coalesce. More 
generally, the times 7fc, k — n, n — 1, . . . , 2 for which the tree has k 
branches are independent exponential random variables with 

E(2W=(*) -1 , 

after which a pair of branches (chosen uniformly at random independ- 
ently of all other events) will join, or coalesce. The resulting random 
tree is called the n-coalescent, or often just the coalescent. Note that 
we have described the coalescent as a random tree. Kingman's original 
papers elegantly formulated the n-coalescent as a stochastic process on 
the set of equivalence relations on {1,2, ... ,n}. The two formulations 
are equivalent. We view the tree description as more intuitive. 

In a natural sense the tree describes the important part of the gene- 
alogical history of the sample, in terms of their genetic composition. 
It captures their shared ancestry, due to the demographic process. As 
noted above, a key observation is that in neutral models the distribution 
of this ancestry is independent of the genetic types which happen to be 
carried by the individuals in the population. Probabilistically, one can 
thus sample the coalescent tree and then superimpose genetic types. For 
example, at stationarity, first choose a type for the most recent common 
ancestor of the population (the type at the root of the coalescent tree) 
according to the stationary distribution of the mutation process, and 
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then track types forward through the tree from the common ancestor, 
where they will possibly be changed by mutation. 

The preceding recipe gives a simple means of simulating the genetic 
types of a sample of size n from the population. Note that this is an 
early example of what has more recently come to be termed 'exact sim- 
ulation': a finite amount of simulation producing a sample with the exact 
distribution given by the stationary distribution of a Markov process. In 
addition, it is much more computationally efficient than simulating the 
entire population forward in time for a long period and then taking a 
sample from it. Finally, it reveals the complex structure of the distribu- 
tion of genetics models at stationarity — the types of each of the sampled 
chromosomes are (positively) correlated, exactly because of their shared 
ancestral history. 

We motivated the coalescent from the Wright -Fisher model, but the 
same limiting genealogical tree arises for any of the large class of demo- 
graphic models in the domain of attraction of the Fleming- Viot diffusion. 
See Kingman [34 for an elegant formulation and proof of this kind of 
robustness result. Moreover, the ways in which the tree shape changes 
under different demographic scenarios (e.g. changes in population size 
or geographical population structure) is well understood [55] . [28] . 

The discrete construction of the Fleming- Viot process described above 
actually embeds the coalescent and the forward diffusion in the same 
framework, so that one can think of the coalescent as describing the 
genealogy of a sample from the diffusion. 

There is even a natural limit, as n — > oo, of the n-coalescents, intro- 
duced and studied by Kingman [32] . This can be thought of as the limit 
of the genealogy of the whole population, or as the genealogy of the 
infinite population described by the Fleming- Viot process (though this 
perspective was not available when Kingman introduced the process). 
The analysis underlying the relevant limiting results for this population- 
genealogical process is much more technical than that outlined above for 
the fixed-sample-size case [5], 0, [10] • It is easiest to describe this tree 
from the root, representing the common ancestor of the population, for- 
ward to the tips, each of which represents an individual alive at the 
reference time. The tree has k branches for a random period of time 
Tk, after which a branch, chosen uniformly at random, independently 
for each fc, splits to form two branches. The times T/c, k = 2, 3, . . . , 
are independent exponential random variables, and independent of the 
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topology of the tree, with 
Write 

OO 

fe=i 

for the total depth of the tree, or equivalently for the time back until 
the population first has a common ancestor. Note that T is a.s. finite. 
In fact E(T) = 2. 

To date, we have focussed on models for a small segment of DNA. For 
larger segments, in diploid populations, one has to allow for the process 
of recombination. Consider a particular human chromosome inherited 
from a parent. The parent will have two (slightly different) copies of this 
chromosome. Think of the process which produces the chromosome to 
be passed on to the offspring as starting on one of the two chromosomes 
in the parent and copying from it along the chromosome. Occasionally, 
and for our purposes randomly, the copying process will 'cross over' to 
the other chromosome in the parent, and then copy from that, perhaps 
later jumping back and copying from the original chromosome, and so 
on. The chromosome passed on to the offspring will thus be made up 
as a mosaic of the two chromosomes in the parent. The crossings over 
are referred to as recombination events. In practice, these recombination 
events are relatively rare along the chromosome: for example in humans, 
there will typically be only a few recombination events per chromosome. 

The formulation described above can be extended to allow for re- 
combination. In the coalescent framework, the consequence is that in 
going backwards in time, different parts of a chromosome may be in- 
herited from different chromosomes in the previous generation. One way 
of conceptualising this is to imagine each position in the DNA as hav- 
ing its own coalescent tree, tracing the ancestry of the DNA in that 
position. This coalescent tree, marginally, will have the same distribu- 
tion as the coalescent. As one moves along the DNA sequence, these 
trees for different positions are highly positively correlated. In fact, two 
neighbouring positions will have the same tree iff there is no recombin- 
ation event between those positions, on a lineage leading to the current 
sample, since their joint most recent common ancestor. If there is such 
a recombination, the trees for the two positions will be identical back 
to that point, but (in general) different before it. The correlation struc- 
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ture between the trees for different positions is complex. For example, 
when regarded as a process on trees as one moves along the sequence, 
it is not Markov. Nonetheless it is straightforward to simulate from the 
relevant joint distribution of trees, and hence of sampled sequences. The 
trees for each position can be embedded in a more general probabil- 
istic object (this time a graph rather than a tree) called the ancestral 
recombination graph [21], [22] . 



3 Inference under the coalescent 

The coalescent has revolutionised the way we think about and analyse 
population genetics models, and changed the way we simulate from these 
models. There are several important reasons for this. One is the separ- 
ation, for neutral models, of demography from the effects of mutation. 
This means that many of the properties of samples taken from genetics 
models follow from properties of the coalescent tree. A second reason 
is that the coalescent has a simple, indeed beautiful, structure, which 
is amenable to calculation. Most questions of interest in population ge- 
netics can be rephrased in terms of the coalescent, and the coalescent 
is a fundamentally simpler process than the traditional forwards-in-time 
models. 

The body of work outlined above has an applied probability flavour; 
some of it more applied (for example solving genetics questions of in- 
terest), and some more pure (for example the links with measure- valued 
diffusions). Historically, much of it occurred in the 10-15 years after 
Kingman's coalescent papers, but even today 'coalescent methods' as 
they have become known in population genetics, are central to the ana- 
lysis of genetics models. 

If an applied probability perspective prevailed over the first 10-15 
years of the coalescent's existence, the last 10-15 years have seen a par- 
allel interest in statistical questions. Since the early 1990s there has 
been a steady growth in data documenting molecular genetic variation 
in samples taken from real populations. Over recent years this has be- 
come a deluge, especially for humans. Instead of trying to study prob- 
abilistic properties of the coalescent, the statistical perspective assumes 
that some data come from a coalescent model, and asks how to do stat- 
istical inference for parameters in the model, or comparisons between 
models (for example arising from different demographic histories for the 
population). 
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There have been two broad approaches. One has been to attempt to 
use all the information in the data, by basing inference (in either a fre- 
quentist or Baycsian framework) on the likelihood under the model: the 
probability, regarded as a function of parameters of interest, of observing 
the configuration actually observed in the sample. This is the thread we 
will follow below. Full-likelihood inference under the coalescent turns out 
to be a difficult problem. A second approach has been to summarise the 
information in the data via a small set of summary statistics, and then 
to base inference on these statistics. One particular, Bayesian, version of 
this approach has come to be called approximate Bayesian computation 
(ABC): one approximates the full posterior distribution of parameters of 
interest conditional on the data by their conditional distribution given 
just the summary statistics [2j. 

Full-likelihood inference under the coalescent is not straightforward, 
for a simple reason. Although the coalescent enjoys many nice properties, 
and lends itself to many calculations, no explicit expressions are avail- 
able for the required likelihoods. There is one exception to this, namely 
settings in which the mutation probabilities, 7^, that a chromosome is 
of type j when its parent is of type i, depend only on j. This so-called 
parent-independent mutation is unrealistic for most modern data, not- 
withstanding the fact that any two-allele model (that is, when the type 
space E consists of only two elements) can be written in this form. For 
parent-independent mutation models, the likelihood is multinomial. 

In the absence of explicit expressions for the likelihood, indirect ap- 
proaches, typically relying on sophisticated computational methods, 
were developed. Griffiths and Tavare were the pioneers [53], [23], [25] . 
|26j . They devised an ingenious computational approach whereby the 
likelihood was expressed as a functional of a Markov chain arising from 
systems of recursive equations for probabilities of interest. Felsenstein 
[T§] later showed the Griffiths-Tavare (GT) approach to be a particu- 
lar implementation of importance sampling. In contrast to the GT ap- 
proach, Felsenstein and colleagues developed Markov chain Monte Carlo 
(MCMC) methods for evaluating coalescent likelihoods. These were not 
without challenges: the space which MCMC methods explored was effect- 
ively that of coalescent trees, and thus extremely high-dimensional, and 
assessment of mixing and convergence could be fraught. As subsequent 
authors pointed out, failure of the Markov chains to mix properly resul- 
ted in poor approximations to the likelihood |18] . 

Donnelly and Stephens [49] adopted an importance sampling ap- 
proach. They reasoned that if the GT approach was implicitly doing 



12 



Peter Donnelly and Stephen Leslie 



importance sampling, under a particular proposal distribution which 
arose automatically, then it might be possible to improve performance 
by explicitly choosing the proposal distribution. In particular, they noted 
that the optimal proposal distribution was closely related to a particular 
conditional probability under the coalescent, namely the probability that 
an additional, n + 1st sampled chromosome will have type j conditional 
on the observed types in a sample of n chromosomes from the popula- 
tion. This conditional probability under the coalescent does not seem to 
be available explicitly (except under the unrealistic parent-independent 
mutation assumption) — indeed an explicit expression for this probability 
leads naturally to one for the required likelihoods, and conversely. 

Donnelly and Stephens exploited the structure of the discrete repres- 
entation of the Fleming- Viot diffusion to approximate the key condi- 
tional probabilities. In effect, in the Donnelly-Kurtz process they fixed 
the types on the first n levels and ran the level n+1 process. This led nat- 
urally to an approximation to the conditional distribution of the n + 1st 
sampled chromosome given the types of the first n chromosomes, which 
in turn leads naturally to importance sampling proposal distributions. 
As had been hoped, importance sampling under this family of proposal 
distributions was considerably more efficient than under the GT scheme 

S3- 

There has been considerably more activity in the area of inference 
under the coalescent over the last 10 years. We will not pursue this here, 
as our narrative will take a different path. Interested readers are referred 
to [51] and [H]. 



4 The Li and Stephens model 

As we have noted, statistical inference under the coalescent is hard. From 
our perspective, a key breakthrough came earlier this decade from Li and 
Stephens |36j . Their idea was very simple, and it turns out to have had 
massive impact. Li and Stephens argued that instead of trying to do 
inference under the coalescent one should appreciate that the coalescent 
is itself only an approximation to reality, and that one might instead do 
inference under a model which shares many of the nice properties of the 
coalescent but also enjoys the additional property that full likelihood 
inference is straightforward. 

Li and Stephens changed the model. Inference then became a tractable 
problem. What matters is how good these inferences are for real data 
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sets. Although not obvious in advance, it turns out that for a very wide 
range of questions, inference under the Li and Stephens model works 
well in practice. 

A forerunner to the Li and Stephens approach arose in connection 
with the problem of estimating haplotype phase from genotype data. 
Stephens, Smith, and Donnelly [55] introduced an algorithm, PHASE, in 
which the conditional distribution underpinning the Donnelly-Stephens 
importance-sampling proposal distribution was used directly in a pseudo 
Gibbs sampler. PHASE has been widely used, and even today provides one 
of the most accurate methods for computational recovery of haplotype 
phase. (Several recent approaches aim to speed up computations to allow 
phasing of genome-wide data sets, typically at a slight cost in accuracy 
e.g. Beagle [5J, @]; FastPhase [46]; and IMPUTE 2 [29]. See [39] for a 
review of some of these methods.) 

We now describe the Li and Stephens model. For most modern data 
sets it is natural to do so in the context of 'SNPs'. A SNP, or single 
nucleotide polymorphism, is a position in the DNA sequence which is 
known to vary across chromosomes. At the overwhelming majority of 
SNPs there will be exactly two variants present in a population, and we 
assume this here. For ease, we will often code the variants as and 1. 
To simplify the description of the model we assume haplotype data are 
available. This is equivalent to knowing the types at each SNP separ- 
ately along each of the two chromosomes in a diploid individual. (Most 
experimental methods provide only the unordered pair of types on the 
two chromosomes at each SNP, without giving the additional informa- 
tion as to which variant is on which chromosome. As noted above, there 
are good statistical methods for estimating the separate haplotypes from 
these genotype data.) 

It is convenient, and for many purposes most helpful, to describe the 
Li and Stephens model via the conditional probabilities it induces, and 
in particular by specifying the probability distribution for the n + 1st 
sampled chromosome given the types of the first n chromosomes. This 
in turn can be described by a recipe for simulating from this conditional 
distribution. (We return below to a probabilistic aside on this perspect- 
ive.) 

In effect, the Li and Stephens model simulates the n+lst chromosome 
as an imperfect mosaic of the first n chromosomes. To simulate the n+lst 
chromosome, first pick one of the existing n chromosomes at random. 
At the first SNP copy the type from the chosen chromosome, but with 
random 'error' in a way we will describe below. With high probability 
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(specified below) the second SNP will be probabilistically copied from 
the same chromosome. Alternatively, the chromosome for copying at 
the second SNP will be re-chosen, uniformly and independently. Having 
simulated the type on the n + 1st chromosome at the fcth SNP, the 
k + 1st SNP will be copied from the same chromosome as the fcth SNP 
with high probability, and otherwise copied from a chromosome chosen 
independently, and uniformly at random, from the first n. It remains to 
specify the probabilistic copying mechanism: with high probability at a 
particular SNP, the value of the n+ 1st chromosome will be the same as 
that on the chromosome being copied, otherwise it will have the opposite 
type. 

The connection with the coalescent comes from the following. Con- 
sider the position of the first SNP on the n + 1st chromosome. Ignoring 
coalescences amongst the first n sampled chromosomes at this position, 
the ancestry of the n + 1st sampled chromosome coalesces with exactly 
one of the lineages leading to the n sampled chromosomes. Ignoring 
mutation, the type of the n + 1st chromosome at this position will be 
the same as the type on the chromosome with which its ancestry co- 
alesces. To incorporate mutation one allows mis-copying of the ancestral 
type. This mis-copying is an oversimplification of the effect of mutation 
on the coalescent tree at this position. Now, moving along the n + 1st 
chromosome, there will be a segment, up to the first recombination event 
in the relevant history, which shares the same ancestry (and so is copied 
from the same one of the sampled chromosomes) . The effect of recombin- 
ation is to follow different ancestral chromosomes and this is mimicked 
in the Li and Stephens approach by choosing a different chromosome 
from which to copy. The probabilities of this change will depend on the 
recombination rates between the SNPs, and in a coalescent also on n, 
because coalescence of the lineage of the n + 1st chromosome to one of 
the other lineages happens faster for larger n. 

We now describe the model more formally. Suppose that n chromo- 
somes (each of which can be thought of as a haplotype) have been 
sampled from the population, where the jth haplotype has the SNP 
information at I SNPs, cP = {cj, c^, . . . , }. Let us call this set of chro- 
mosomes C. Now suppose an additional chromosome i has been sampled 
and has SNP information h % = {h\, h\, ■ ■ ■ , h\}. We seek to determine 
the probability of sampling this chromosome, based on its SNP haplo- 
type and the SNP haplotypes of the previously sampled chromosomes 
C. The model takes as input fine-scale estimates of recombination rates 
in the region: r = {r ,ri, . . . , r/} where rj + \ — rj is the average rate 
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of crossover per unit physical distance per meiosis between sites j and 
j + 1 times the physical distance between them. We set ro = 0. We ob- 
tain this map from elsewhere (for example |31j ) rather than estimating 
it for ourselvetQ Note that the SNPs (and the map) are ordered by the 
position of the SNP (or map point) on the chromosome (for convenience 
we refer to the first SNP position as the leftmost position and the Ith 
SNP position as the rightmost). We define the per- locus recombination 
probability p s = 1 — exp(— AN e (r s+ i — r s )/n) and then define transition 
probabilities for a Markov chain on {1, 2, . . . , n} from state j (indicating 
that it is the jth haplotypc of those that have been previously sampled 
that is 'parental') at position s to state k at position s + 1: 

/. , \ Jl - Ps+ Ps/n, j = k, 
q(js,k 8 +i) = < , . . , (4.1) 

{Ps/n, j t k, 

where N e is the so-called effective population size, a familiar quantity 



in population genetics models. Equation (4.1) is related to the fact that 
recombination events occur along the sequence as a Poisson process. 
Here we use the Poisson rate AN e (r s+1 — r s )/n. Given the rate, the 
probability that there is no recombination between sites s and s + 1 
is exp(— 4A^ e (r s+ i — r s )/n) = 1 — p s . The probability of at least one 
recombination between sites s and s + 1 is thus p s . In this case the model 
has the assumption that it is equally likely that the recombination occurs 
with any of the n sampled haplotypes. In particular, as p s incorporates 
the probability that multiple recombinations occur between sites s and 



s + 1, the first case in Equation (4.1 ) includes a p s term to allow for the 



possibility that the same haplotype is parental at each site s and s + 1 
even when one or more recombinations have occurred. 

We define the copying probabilities in terms of the 'population muta- 
tion rate' for the given sample size (#, defined below), another familiar 
quantity from population genetics. The mismatch (or not) between the 
SNP allele of the jth 'parent' chromosome at SNP s, c J s , and the SNP 
allele of the ith additional 'daughter' chromosome, h\ : is defined as 




Notice that as 9 — > oo the alleles and 1 at any given site become 

1 In fact, Li and Stephens developed their model precisely for estimating the 
genetic map, but for our purposes we wish to utilize the model for inference in 
other settings and thus we utilize a known genetic map. 
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equally likely. Equation (4.2) is motivated by a similar coalescent argu- 
ment to that used for the transition probabilities above. In this case the 
probability of no mutations occurring at the site s is n/(n + 9) and thus 
the probability of at least one mutation occurring at s is 9/(n + 9). It is 
possible to allow for the population mutation rate to vary sitewise if it 
is necessary to account for known variable mutation rates. 

The particular form of the transition and copying probabilities in the 
model follow from the informal coalescent arguments given four para- 
graphs above ([3B]). We noted that the recombination probabilities typ- 
ically come from available estimates. It turns out that the accuracy of 
these estimates can be important in applications of the model. In con- 
trast, such applications are generally not especially sensitive to the exact 
value of 9 used. Thus, for the mutation probabilities, we follow Li and 
Stephens and set 



(4.3) 



We can view the Li and Stephens process as defining a path through 
the previously sampled sequences C. This is illustrated in Figure |4~T| 

A key feature of the conditional probabilities just described for the 
Li and Stephens model is that they take the form of a hidden Markov 
model (HMM) [IS]. The path through the sampled chromosomes is a 
Markov chain on the set {1, 2, . . . , n} which indexes these chromosomes: 
the value of the chain at a particular SNP specifies which chromosome 
is being copied at that SNP to produce the new chromosome. In the 
language of HMMs, the transition probabilities specify the probability 
of either continuing to copy the same chromosome or choosing another 
chromosome to copy, and the emission probabilities specify the prob- 
ability of observing a given value at the SNP on the new chromosome 
given its type on the chromosome being copied. The latter have the 
simple form that with high probability the new chromosome will just 
copy the type from the chromosome being copied, with the remaining 
probability being for a switch to the opposite type from that on the 
chromosome being copied. Viewed this way, the transition probabilities 
relate to the recombination process and the emission probabilities to the 
mutation process. The reason that the HMM structure is crucial is that 
very efficient algorithms are available for calculations in this context. For 
example, under the Li and Stephens model, given values for all the SNPs 
on the n + 1st chromosome, and good computational resources, one can 
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Figure 4.1 A pictorial representation of the Li and Stephens model for 
the simulation of a new haplotype. Here we have sampled sequences 
c 1 , . . . , c 6 and seek to simulate a new haplotype h l . The model first 
simulates a 'path' through the sampled sequences, indicated in the 
figure by the arrows, which show which of the c l is copied at each SNP 
locus. Recombination with another parental sequence is indicated by 
changing the sequence being copied (for example, between the second 
and third loci) . The dark box on the path at the fifth locus indicates 
that a mutation occurred in the copying of that locus. 



calculate the conditional probability of each possible path through the 
first n chromosomes, or of the maximum likelihood path (see |12j for 
details). 

Not only is the Li and Stephens model tractable, in a way that the 
coalescent is not, but it turns out that its use for inference in real pop- 
ulations has been very successful. Examples include inference of haplo- 
type phase [46] , [51] , [37] , [38] ; inference of fine-scale recombination rates 
[36] ; imputation of unobserved genotype data [40] , [47] ; and imputation 
of classical HLA types from SNP data [35] . It seems that the model cap- 
tures enough of the features of real data to provide a good framework 
for inference. Because inference under the coalescent is impossible, it is 
not known whether this would have properties which are better or worse 
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than those under Li and Stephens, though we would not expect major 
differences. 

We have specified the Li and Stephens model via its conditional prob- 
abilities, and these are what is crucial in most applications. But there 
is a probabilistic curiosity which is worth mentioning. One could calcu- 
late the probability, under the model, for a particular configuration of 
n chromosomes via these conditional probabilities: for a particular or- 
dering, this would simply be the product of the marginal probability for 
the first chromosome, the Li and Stephens conditional probability for 
the second chromosome given the first, the Li and Stephens conditional 
probability for the third given the first and second, and so on. In fact, 
Li and Stephens themselves referred to their model as the product of 
approximate conditionals, or PAC, likelihood. The probabilistic curios- 
ity is that in this formulation the likelihood, or equivalently sampling 
probabilities, in general depend on the order in which the chromosomes 
are considered. One could solve the problem by averaging over all pos- 
sible orders, or approximately solve it by averaging over many orders, 
the approach adopted by Li and Stephens. But for most applications, 
including the one we describe below, it is the conditional distributions 
which matter, either in their own right (e.g. [35]) or for use in what 
resembles a Gibbs sampler. This latter approach gives rise to another 
curiosity: one can write down, and implement, an algorithm using the Li 
and Stephens conditionals as if it were a Gibbs sampler, even though the 
conditionals do not obviously correspond to a proper joint distribution. 
These algorithms (of which PHASE was perhaps the first) often perform 
extremely well in practice, notwithstanding the gap in their probabil- 
istic pedigree, an observation which might warrant further theoretical 
investigation. 



5 Application: modelling population structure 

In this section we describe an extension of the Li and Stephens model 
appropriate for geographically structured populations, and then show 
how inference under the model performs well on real data. 

Real populations often consist of genetically distinct subpopulations, 
and there has been considerable interest in the development of stat- 
istical methods which detect such subpopulations, and assign sampled 
individuals to them, on the basis of population genetics data [3J, [44], 
[43] , [45] , [H] , [6] , [7] . Understanding population structure is of interest 
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in conservation biology, human anthropology, human disease genetics, 
and forensic science. It is important to detect hidden population struc- 
ture for disease mapping or studies of gene flow, where failure to detect 
such structure may result in misleading inferences. Population struc- 
ture is common amongst organisms, and is usually caused by subpop- 
ulations forming due to geographical subdivision. It results in genetic 
differentiation — frequencies of variants, (called alleles in genetics) that 
differ between subpopulations. This may be due to natural selection in 
different environments, genetic drift (stochastic fluctuations in popula- 
tion composition) in distinct subpopulations, or chance differences in the 
genetic make up of the founders of subpopulations |27] . 

Model-based approaches to detecting and understanding population 
structure have been very successful [45], [E], [6], [7j. Broadly, one spe- 
cifies a statistical model describing data from different subpopulations 
and then performs inference under the model. These models are ex- 
amples of the statistical class of mixture models, in which observations 
are modelled as coming from a (typically unknown) number of distinct 
classes. In our context the classes consist of the subpopulations, and 
what is required is a model for the composition of each, and the way 
in which these are related. Model-based approaches to understanding 
population structure have several natural advantages. Results are read- 
ily interpretable, and, at least for Bayesian inference procedures, they 
provide a coherent assessment of the uncertainty associated with the 
assignment of individuals to subpopulations, and the assessment of the 
number of populations in the sample. 

Where the data consist of SNPs taken from distinct regions of the 
genome it is natural to model these as being independent of each other, 
within subpopulations, and it remains only to model the frequency of 
the alleles at each SNP in each subpopulation, and the joint distribution 
of these across subpopulations. This approach was taken by Pritchard, 
Stephens and Donnelly |H>j and implemented in the program STRUCTURE 
which has been successfully used in a variety of applications. 

In some contexts the population in question arose from the mixing, 
or admixing as it is known in genetics, of distinct populations at some 
time in the relatively recent past. African- American populations are an 
example of admixture, in this case between Caucasian and African pop- 
ulations. Such admixture is well known to lead to correlations between 
SNPs over moderately large scales (say 5-10 million bases) across chro- 
mosomes, known as admixture linkage disequilibrium. Falush, Stephens 
and Pritchard [16] adapted the model underlying STRUCTURE to incor- 
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porate these correlations, with the new model also having been widely 
applied PS], 02]. 

SNPs which are close to each other on the chromosome exhibit correl- 
ations, known in genetics as linkage disequilibrium, due to their shared 
ancestral history. Both the coalescent with recombination and the Li 
and Stephens model explicitly capture these correlations for a single 
randomly-mating population. To employ a model-based approach to 
population structure for nearby SNPs with linkage disequilibrium, one 
needs to model these correlations within and between subpopulations. 
We introduce such a model below as a natural extension of the Li and 
Stephens model. 

For simplicity throughout, we describe our model assuming the hap- 
lotypes of the sampled individuals are known, so we assume that the 
phase of the data is known, either directly or by being inferred using 
a phasing method such as [52] or [46]. Given the accuracy of statistical 
phasing methods, this is a reasonable assumption in most cases, but as 
we note below, the assumption can be dropped, at some computational 
cost. 

Suppose we have DNA sequence data (SNPs) for N haploid individuals 
(or N phased haplotypes) sampled from K populations at L loci. Call 
these data H — {hi, . . . , hx}, where the index i represents individual 
i. Define the population assignment of individual i to be the discrete 
random variable Zi which can take values in {1, . . . , K}. In contrast to 
the previous methods, however, we make no assumption that the SNPs 
are independent (or merely loosely dependent as is the case for [16 ) 
i.e. we explicitly deal with linkage disequilibrium. In order to model 
linkage disequilibrium, by applying the model of Li and Stephens [36], 
we require a genetic map of the region covering the L loci. We assume we 
have such a map, obtained from other sources (e.g. the program LDhat 
[4"T] , [32] , Q] or from applying the Li and Stephens method for estimating 
recombination rates [36] ) . In specifying the model we assume we know 
the value of K. 

We wish to allocate sequences to populations based on the data H. 
As is natural in mixture models, we do so in a Bayesian framework, via 
Gibbs sampling (or to be more precise, as we note below, via pseudo- 
Gibbs sampling) over the unobserved allocation variables, the Zi. To do 
so we need to specify the conditional distribution of Zi given the data 
H and the values of the allocation variables Z for individuals other than 
individual i. 
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We specify the prior distribution on the Zi to be uniform, i.e. 

P(Zi = 1) = P(Z< = 2) = • • • = P(Zi =K) = jf 

for every i = 1, . . . , N. It is a simple matter to include an informative 
prior if required. Furthermore, we assume that in the absence of any 
data hi, Zi is independent of all of the Zj and hj for i =/= j. 

We first informally describe the model in the special case in which 
the ancestry of any chromosome only ever involves chromosomes in its 
own population. In effect this means that there has been no migration 
between subpopulations over the time back to the common ancestors of 
sampled chromosomes. In this special case, we can think of calculating 
the conditional distribution as follows. (We describe the calculation of 
the conditional distribution of Z\ given H and the other Zi.) Suppose 
we know the current population assignments in our sampler of all of the 
haplotypes. We wish to update the assignment of haplotype 1 based on 
the assignments of all of the other haplotypes. First, we remove haplo- 
type 1 from the population it is currently assigned to. Then, we calculate 
the Li and Stephens conditional probability that h\ would be the next 
sampled haplotype from each of the K populations. We normalize these 
probabilities by dividing by their sum and draw the new population as- 
signment of haplotype 1 from the multinomial distribution with these 
normalized probabilities as the population weights. 

We now introduce the general version of our new model, which extends 
the simple Li and Stephens models above to incorporate migration or 
recent admixture, and use this as the basis for our method for classi- 
fying individuals into populations. Importantly, the model reduces to 
the Li and Stephens model in certain limiting cases. The method uses 
SNP haplotype data and explicitly models correlations between SNP 
loci using the model and estimates of recombination rates between the 
loci. We test our method on a sample of individuals from three contin- 
ents and show that it performs well when classifying individuals to these 
continental populations. 

We extend the Li and Stephens [36] model to incorporate admixture or 
inter-population migration by adding another parameter to the model, 
represented by a, which models the extent of shared ancestry between 
populations. This parameter attempts to capture the contribution to an 
individual's ancestry which is derived from a population other than its 
own. It does not have a natural interpretation in terms of the dynamics 
of the populations forward in time. In this sense it has more in common 
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with parameters in statistical models than in probability models. This 
extended model is perhaps best understood by considering the 'copying 
path' representation of the Li and Stephens model (see Figure 4.1 ). For 
simplicity we present the extended model for the case of two populations 
(i.e. K — 2) with a fixed value of a. We discuss further generalizations 
after the initial presentation of the model. 



As in the special case above, the model allows calculation of the 
probability that a particular haplotype is sampled from a given pop- 
ulation. Normalising these probabilities then gives the probabilities for 
resampling the allocation variables. 



The extension of Li and Stephens relates to recombination, or in the 
sense of Figure |4.1[ to the step when there is a potential change to the 
chromosome being copied. When such a potential change occurs in the 
simple model a new chromosome is chosen to be copied uniformly at 
random (including the chromosome currently being copied). In the ex- 
tension described above, this remains the case, but with the new chro- 
mosome allowed to be chosen only from the population being considered. 
Our generalisation allows the possibility that the copying process could 
switch to a chromosome in a different population, and the parameter a 
controls the relative probability of this. 



We now give the details of the model we use. Suppose that n\ and ni 
(where n\ + = n) chromosomes have been sampled from populations 
1 and 2 respectively, where the jth haplotype in the total population 
has the SNP information at I SNPs, cP = {cj, c|, . . . , c?}. Let us call 
this set of chromosomes C = C\ U C2, where C\ are the chromosomes 
sampled from Population 1, and C2 from Population 2. Now suppose an 
additional chromosome i has been sampled and has SNP information 
h l = {h\, h % 2, ■ ■ ■ , h\}. Without loss of generality we seek to determine 
the probability of sampling this chromosome from Population 1, based 
on its SNP haplotype and the SNP haplotypes of the previously sampled 
chromosomes C . We define the per locus recombination probability p s = 
1 — exp(— AN e (r s+ i — r s )/(ni + 0:1712)) and then define the transition 
probabilities from state j (indicating that it is the jth haplotype of 
those that have been previously sampled that is 'parental') at position 
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s to state k at position s + 1: 
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where iV e is the effective population size. Notice that when cti = 
the transition probabilities so defined reduce to the Li and Stephens 
transition probabilities for a single sample from Population 1. Also note 
that when ct\ = 1 we have effectively combined the two samples into 
a single sample from one population and again we obtain the Li and 
Stephens transition probabilities for this case. 

We define the emission probabilities in terms of the 'population muta- 
tion rate' for our given sample size, where in this case our 'sample size' 
is ni + ain 2 , 



Again, we obtain the desired Li and Stephens population mutation rates 
for the a\ = and ol\ — 1 cases. We then define the mismatch (or not) 
between the SNP allele of the jth 'parent' chromosome at SNP s, c 3 s , 
and the SNP allele of the ith additional 'daughter' chromosome, h\: 



As before, notice that these emission probabilities reduce to the ana- 
logous Li and Stephens emission probabilities for the cases «i = and 
Ct\ = 1. 

As was the case for a single population, we can view this process as 
defining a path through the previously sampled sequences C. This is 
illustrated in Figure [5~T] 

Using this model we proceed as before. To calculate the conditional 
probability of observing the additional haplotype we sum over all pos- 
sible paths through the potential parental chromosomes C, We use the 
forward algorithm which gives a computationally efficient means of per- 
forming the required summation |12j . For each of the n previously sam- 
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(5.3) 
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Figure 5.1 A pictorial representation of the calculation for a single 
path using the new model. Here we have sampled sequences c 1 , . . . , 
c 6 from two populations, represented by the first three rows and the 
second three rows in the figure, respectively. We seek to calculate the 
probability of sampling h l (seventh row) from the first population 
by summing the probabilities obtained over all possible paths. We 
illustrate this with a single path. The arrows indicate the 'path' taken 
through the sampled sequences, indicating which of the c % has the 
'parental' type at a given SNP locus. Recombination with another 
parental sequence is indicated by changing the sequence being copied 
(for example, between the second and third loci). The dark box on 
the path at the fifth locus indicates that a mutation occurred in the 
copying of that locus. A dashed arrow indicates that the copying at 
the next position is taking place with a sequence not in the population 
we are interested in (in this case the first population) and thus the 
recombination term in the model is scaled by the a parameter for 
these terms. 



pled chromosomes, we initialise the forward algorithm: 
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The forward algorithm moves along the sequence such that at each SNP 
s, where 1 < s < I, 

n 

H = e(K, cj) ]T fs-l x l(ks-i,Js)- (5.5) 

k=l 

The probability of observing the SNP configuration of the additional 
chromosome is given by 

n 

*(.ti\C,6,p)=Y i ff. (5.6) 

A simple extension for K > 2 treats the current population of interest 
as one population and all other populations are combined to form the 
second population, thus reducing this case to the K = 2 case. This is 
what we consider here. To further generalize the model one may define 
a matrix of a parameters for each ordered pair of populations. Infer- 
ence for the number of populations represented in the sample (K) may 
be performed using the method of [35] or by other methods. We focus 
here only on the problem of classifying individuals where the number of 
subpopulations is known. 

As noted above, it is a slight abuse of terminology to refer to the 
process just described as Gibbs sampling. In this case, we cannot be 
certain that our approximations do in fact correspond to a proper joint 
probability. Nonetheless, there are successful precedents for this use of 
'pseudo- Gibbs' sampling, notably the program PHASE [52"] . [50] which has 
proved successful for inferring haplotypes from genotypes. Furthermore, 
our method is successful in practice. Given these caveats, we continue 
our abuse of terminology throughout. 

It is also possible to set the proportion of the sample coming from 
each population as a parameter and update this value at each step. 
This would allow us to deal easily with populations of different sizes 
in a straightforward manner, although we have not implemented this 
extension. Extending the model to incorporate genotypes rather than 
haplotypes is also straightforward in principle, for example by analogy 

with gnj. 

Our model applies when SNPs are sampled from the same small chro- 
mosomal region. In practice, it is more likely that data of this type will 
be available for a number of different, separated, chromosomal regions. 
In this context we apply the model separately to each region and com- 
bine probabilities across regions multiplicatively. We implement this by 
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setting p s = 1 in Equation (5.1 1 for the transition from the last SNP in 
one region to the first SNP in the next region (the order of the regions 
is immaterial). 

We tested the method on phased data available from the Phase II 
HapMap [31] ■ In particular the data consisted of SNP haplotypes from 
samples from four populations: Yoruba ancestry from Ibadan in Nigeria 
(YRI); European ancestry from Utah (CEU); Han Chinese ancestry from 
Beijing (CHB); and Japanese ancestry from Tokyo (JPT). We use the 
SNP haplotype data available from the HapMap and the recombination 
map estimated in that study from the combined populations. The YRI 
and CEU HapMap samples are taken from 30 parent offspring trios in 
each case. Of these we used the data from the parents only, giving 60 
individuals (120 haplotypes) in each population. The Chinese and Japan- 
ese samples derive from 45 unrelated individuals (90 haplotypes) in each 
case. Following common practice, we combine the Chinese and Japanese 
samples into a single 'analysis panel', which we denote by CHB+JPT. 
Thus our sample consists of a total of 420 haplotypes deriving from three 
groups. 

From these data we selected 50 independent regions each of 50 kilob- 
ases in length (i.e. regions not in linkage disequilibrium with each other). 
Within these regions we selected only those SNPs that are variable in 
all three populations and have a minor allele frequency of at least 0.05. 
A summary of the data is given in Table |5.1| 

We then applied our new method to these data. In all cases we set 
the effective population size for our model, N e , to 15,000 and set a = 
oil — ct2 (this may be thought of as the simple situation when there is 
an equal amount of gene flow in each direction) . 

As a first test of using the new model we decided to test the sensitivity 
to the number of regions used and the number of SNPs selected across 
those regions, where we specify the correct number of populations in 
the sample [K = 3) and set a = 0.1. In our experiments we ignore the 
information as to which haplotypes come from which populations — this 
is what we aim to infer, and we can then compare inferences with the 
truth. We denote by r the number of independent regions chosen, where 
r £ {5,10,20,30,40,50}, and by s the number of SNPs chosen in the 
selected regions, where s £ {10, 20, 50, 80, 100}. For every pair of values 
r and s we independently at random selected r regions and then selected 
s SNPs from the selected region to be included in the analysis. We did 
this 20 times for each pair of values r and s and tested the method on 
each of the resulting samples from our data. 
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Table 5.1 Summary of the data: The data consist of HapMap 
samples for 420 haplotypes (120 CEU, 180 CHB+JPT, 120 
YRI) sampled from 50 independent regions of^50kb each, taken 
across all autosomes. SNPs with a minor allele frequency of less 
than 0. 05 in any of the population groups have been excluded. 

For each region the number of SNPs segregating in all 
populations is shown, as well as the region size, which is the 
distance between the first and last SNP in the region. 
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1 6 38600 1 22 68 47223 1 44 6 44916 

2 1 42 48726 2 23 56 49044 2 45 24 47963 

3 2 20 46100 3 24 42 48586 3 46 16 46999 

4 3 49 49031 4 25 38 47651 4 47 32 46961 

5 4 82 43511 5 26 56 44973 5 48 31 49698 

6 5 53 49377 6 27 139 49617 6 49 25 48619 

7 6 37 48387 7 28 34 47905 

8 7 86 49212 8 29 35 47465 

9 8 55 49700 9 30 34 49603 

10 9 59 49620 10 31 23 49827 

11 10 29 46888 11 32 25 49406 

12 11 76 49762 12 33 63 45207 

13 12 30 48056 13 34 53 49803 

14 13 23 49648 14 35 26 45723 

15 14 33 47289 15 36 26 44137 

16 15 85 47207 16 37 30 44109 

17 16 38 47062 17 38 42 48621 

18 17 61 48001 18 39 38 49694 

19 18 33 48359 19 40 38 47725 

20 19 44 48119 20 41 80 44893 

21 20 45 49195 21 42 77 47874 

22 21 1 22 43 8 23443 



In each run we used a burn-in of 200 iterations and then kept the 
following 1,000 iterations. We ran several analyses to confirm that these 
values are sufficient for convergence. Haplotypes were assigned to the 
cluster in which they spent the majority of iterations (after the burn- 
in). To check whether lab el- switching was occurring we kept the pairwise 
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assignments matrix for each run. As with [35] and [T5], label-switching 
was not observed and thus the clusters are well-defined. In order to 
assess performance, clusters were labelled by the population from which 
the majority of their assigned samples were derived. As we shall see, 
given the accuracy of the method, this does not result in any ambiguity. 
Performance was measured as the proportion of haplotypes assigned to 
the correct population, where we measured these proportions for each 
sample population individually, and also for the combined sample. We 
show the average performance over the 20 runs for each pair of values r 



and s in Figure 5.2 



Examination of Figure |5.2| reveals some insights about the perform- 
ance of the method. As would be expected, across all populations, for 
each fixed value of the number of regions selected (r), average perform- 
ance increases with the number of SNPs used. Thus, the more SNP in- 
formation we have, the more accurate are our population allocations. In 
general, increasing the number of independent regions used has less effect 
on the accuracy of the population assignments, although accuracy does 
increase with increasing the number of regions used. We conclude that 
applying our method to data sets comprising at least 80 SNPs derived 
from at least 10 independent regions will give high accuracy (> 95%) 
with good precision (the standard deviation over all runs observed in 
this case was less than 2%). 

We then tested the effect of varying the a parameter, for a in the 
range (0, 0.3). In this case we used a single set of SNPs for testing. We 
used a set of 50 SNPs derived from each of 10 regions which had given 
average accuracy over previous tests (approximately 90% of individu- 
als classified correctly in the initial tests). We selected this set as it 
represents a typical possible application of the method and also gives 
reasonable scope to examine the effect of a both in decreasing and in- 
creasing accuracy. For each value of a we ran the method 10 times with 
randomly chosen initial assignments, using a burn-in of 200 iterations 
and retaining 1,000 iterations after the burn-in. We observe that for a 
given value of a up to 0.2 the sampler converges to virtually the same 
result over all ten runs for all population groups. For a in this range, 
performance varies across populations but remains consistently above 
90% of total chromosomes assigned correctly. The best overall perform- 
ance is observed when a — 0.01. For values of a greater than 0.2 the 
method no longer converges and the average accuracy of assignments is 
reduced. The number of individuals classified correctly for various val- 
ues of a in the range to 0.2 differs by only a small amount, so it is 
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CEU: Proportion Correctly Allocated 
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Figure 5.2 Proportion of haplotypes assigned to the correct popula- 
tion. In this application of our method the number of populations is 
set to the correct number (K — 3), only SNPs with minor allele fre- 
quency (M AF) > 0.05 are included in the analysis, and a — 0.1. Each 
entry at position (r, s) in the four charts relates to a random selec- 
tion of r regions out of the 50 regions in the data, with s SNPs then 
randomly selected for each region. Values shown are averaged over 
20 runs of the method in each case, with a burn-in of 200 iterations 
and 1000 samples kept after the burn-in. Sequences are allocated to 
the cluster to which they are assigned for the majority of the iter- 
ations after the burn-in. The top left chart shows the proportion of 
CEU haplotypes assigned correctly to the CEU cluster. The second, 
third and fourth charts show the equivalent proportions for the YRI, 
CHB+JPT and Total Population respectively. 



difficult to draw too many conclusions from the limited experiments we 
have performed. Close examination of the posterior probabilities for all 
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individuals in the analyses may provide fruitful insights, although this 
is left to future work. 

In conclusion, the method performed well and, provided sufficient 
SNPs from sufficient independent regions are available for the haplotypes 
that are to be classified, gives better than 95% accuracy for classifying 
sequences into these continental populations. The method converges rap- 
idly: in most cases investigated, a burn-in of 200 iterations with a further 
1,000 iterations retained for analysis was seen to be sufficient, but we 
would advocate the use of more iterations than this. It is feasible to 
apply the method to data sets of the scale considered here. 

Note that there are natural, more sophisticated, approaches to hand- 
ling the uncertainty associated with assignment of individuals to pop- 
ulations. Long-run proportions of time spent in different assignments 
naturally lead to posterior probabilities for assignment. These could be 
carried through into subsequent analyses, or thresholded at some (high) 
value, so that population assignments are made only when these pos- 
terior probabilities are close to unity. 
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